Handling Missing Data
Type of data missing
Understanding why data is missing
-
Missing Completely At Random (MCAR): missing entirely by chance, independent of any observed or unobserved values
-
Missing At Random (MAR): missing depends on only on the observed values of other features, not on the missing value itself
- example: older individuals have less data records
-
Missing Not At Random (MNAR): probability being missing depends on the missing value itself or on other unobserved factors
- example: patients with severe symptoms drop out of a clinical trial.
Handling missing data
| Method Type | Method | How it Works | Data Missing Type | Strengths | Key Risks / Notes |
|---|---|---|---|---|---|
| Deletion | List-wise deletion | Remove any row with ≥1 missing value | MCAR | Simple | Large data loss |
| Deletion | Pair-wise deletion | Use all available data per calculation (e.g., correlations) | MCAR | Retains more data than list-wise | Can produce inconsistent covariance matrices |
| Single Imputation (Deterministic) | Mean / Median / Mode Imputation | - use mean/median for continuous value - use mode for discrete value |
MCAR | Simple | Underestimates variance; distorts correlations |
| Single Imputation (Deterministic) | Constant / Out of the range | Use specific values within/out of the range | None (mechanism-agnostic) | Keeps missingness signal | Must add missing indicator; otherwise biased |
| Single Imputation + Indicator | Missing Indicator Method | - replaces each missing value by a zero / binary flag - works for regression: extends the regression model by the response indicator |
MAR | Captures missingness pattern | Can bias linear models; safer in trees |
| Single Imputation (Model-based) | Regression imputation | predict the value by building an interpolator or predicting them based on other features | MAR | Preserves relationships | Underestimates variance |
| Single Imputation (Model-based) | Stochastic regression imputation | enhanced regression imputation, where Imputed Value = Regression Prediction + Random Error | MAR | Preserves variance better | Still single dataset (no pooling) |
| Time-Series Imputation | Last observation carried forward (LOCF) | Use previous timepoint value |
Implicitly assumes no change | Simple for longitudinal | underestimates variability |
| Time-Series Imputation | Baseline observation carried forward (BOCF) | Use baseline value for all missing | Implicitly assumes no change | Simple for longitudinal | underestimates dynamic |
| Time-Series Imputation | Interpolation (Linear / Spline) | Interpolate using adjacent timepoints | MAR + smoothness assumption | Good for dense time series | Fails for abrupt changes |
| Single Imputation (Similarity-based) | K-Nearest Neighbor imputation | Use similar samples to fill missing values | MAR | Non-parametric; captures local structure | Poor in high dimensions; ignores uncertainty |
| Multiple Imputation | [Multiple Imputation by Chained Equations |Multiple Imputation by Chained Equations ] | Iteratively model each variable conditionally | MAR | Flexible; works with mixed data types | Expensive |
| Multiple Imputation | Joint Modeling Multiple Imputation | Model full multivariate distribution (often MVN) | MAR | Statistically coherent | Strong distributional assumptions |
| Multiple Imputation | Multilevel / Hierarchical Multiple Imputation | MI including random effects for clustered/longitudinal data | MAR | Respects within-subject correlation | More complex implementation |
| Multiple Imputation | Predictive Mean Matching (PMM) | Impute by matching predicted values to observed donors | MAR | Produces realistic values; robust to non-normality | Needs sufficient donor pool |
| Likelihood-Based (No Imputation) | Mixed-Effects Model (FIML) | Fit longitudinal model directly using all available data (no explicit imputation) | MAR | Statistically efficient | Only works if analysis model supports it |
| Longitudinal Grid Imputation | Time-Raster Imputation | Insert regular time grid and impute at grid level | MAR | Handles irregular timepoints | Requires structured modeling |
Stats Behind: Model-Based Imputation Frameworks
There are three major statistical frameworks for handling missing data in model-based settings.
Joint Modeling (JM)
- model the full multivariate distribution directly
- impute from the joint distribution
- Examples:
- Multivariate Normal Multiple Imputation
- Bayesian joint models
Fully Conditional Specification (FCS)
- model each variable conditionally on others
- erate through variables sequentially
- Examples:
- Multiple Imputation by Chained Equations
- MissForest (single imputation, non-parametric)
Likelihood-Based Methods (No Imputation)
- do not fill missing values
- estimate model parameters directly using all observed data
- Examples:
- Mixed-effects models with FIML
- Structural equation modeling with FIML